As complex measurements convolve meaningful phenomena with more and more extraneous phenomena, sparsity is becoming an increasingly prevalent objective in statistical analyses. Much of this zeitgeist has been driven by the success of frequentist methods like compressed sensing and LASSO regression, and it is often naively assumed that these properties immediately carry over to the corresponding Bayesian analyses. Sparsity in a Bayesian analysis, however, is induced by fundamentally different properties than those that induce sparsity in a frequentist analysis.
In this case study I’ll review how sparsity arises in frequentist and Bayesian analyses and discuss the often subtle challenges in implementing sparisty in practical Bayesian analyses.
The usual setting for sparsity is a regression over a data set containing many covariates that could possibly correlate with an outcome variate of interest. In this circumstance sparsity is the assumption that only a few of these covariates have any meaningful correlation, although a priori we do not know exactly which covariates will be relevant and which will be irrelevant. A sparse regression encodes this assumption to allow the data to inform both which covariates are relevant and how the relevant covariates then correlate with the outcome variate.
For example we might be interested in classifying individuals in a population into two groups, with many individual characteristics possibly influencing the probabilty of being associated with each group. Without any sparsity assumptions the uncertainty in the irrelevant characterisics will propagate to large uncertainties in inferred associations. If we can isolate only the relevant covariates, however, then we can significantly reduce these uncertainties and improve our inferences.
For simplicity I will restrict the discussion to only general linear models where slopes, \(\boldsymbol{\beta}\), control the influence of each covariate through some effective parameter for each observation, \[ \theta_{n} (\mathbf{x}_{n}, \boldsymbol{\beta}, \alpha) = f ( \mathbf{x}_{n} \cdot \boldsymbol{\beta} + \alpha ). \] Because a covariate will no longer influence the model when the corresponding slope vanishes, sparsity is induced by regularizing the irrelevant slopes towards zero without affecting the relevant slopes and compromising inferences too much.
How this sparsity is realized depends intimately on the inferential perspective that we consider.
Frequentist sparsity is a decision making process that explicitly selects a small subset of covariates to inform the regression while discarding the rest. When executed properly this selection process can drastically reduce the size of the data that needs to be considered and hence the computational cost of making predictions for new observations. When executed improperly the selection process can undermine our ability to infer the variate outcome.
In the general linear model setting the selection of covariates is equivalent to identifying which slopes are zero and which are non-zero. Methods like the LASSO (Tibshirani 1996) achieve sparsity by strongly penalizing the maximum likelihood estimator of the slopes towards small values and then explicitly forcing a slope to zero when it falls below a given threshold during the maximum likelihood estimation. As the optimization proceeds any slope that isn’t needed to maintain the maximum likelihood fit to the observed data is forced to zero and utlimately discarded.
For sufficiently simple data generating processes and large enough data sets these methods tend to be reasonably well-calibrated. We will, on average, discard most of the irrelevant covariates while retaining most of the relevant covariates, and our abilty ot model the variate outcome, regardless of the exact data that we observe.
Bayesian sparsity avoids decisions altogether and instead manifests as a disctint behavior in the posterior distribution. In particular, Bayesian sparsity arises from a posterior distribution that strongly concentrates around the neighborhood in parameter space where the irrelevant slopes are zero and hence inconsequential when making inferences.
As the penalty function induces sparse decisions in the frequentist setting, the prior distribution induces sparse inferences in the Bayesian setting. How exactly, however, can a prior distribution coerce the desired concentration when we don’t know which slopes are irrelevant a priori?
It’s tempting to appeal to the penalty function that is so critical to the success in the frequentist setting. In particular, if we reinterpret the sparsity-inducing penalty functions as a log probability density over parameter space then does that define a sparsity-inducing prior distribution? Unfortunately it does not. In fact the implied prior distibutions stretch the corresponding posterior distribution away from the desired neighborhood where the irrelevan slopes vanish.
The problem is that a sparsity-inducing penalty function has to influence only a single point in parameter space at any given time, whereas a sparsity-inducing prior distribution has to consider the entire parameter space at once.
To highlight the difference between inducing sparsity in a point estimator and inducing sparsity in an entire distribution, let’s consider the \(L_{1}\) penalty at the heart of the LASSO, \[ R_{L_{1}} ( \boldsymbol{\beta} ) = \sum_{m = 1}^{M} \lambda_{m} \left| \beta_{m} \right|. \]
When the maximum likeihood estimate of the slope \(\beta_{m}\) falls below the scale \(\lambda_{m}\) it is regularized towards zero but, because the penalty is nearly flat above the scale, estimates above \(\lambda_{m}\) experience negligible regularization. Given a suitable choice of scales for each covariate this dichotomous behavior of the penalty facilitates the suppresion of the irrelevant slopes below the selection treshold while leaving the relevant slopes undisturbed.
Interpreted as a negative log density, the \(L_{1}\) penalty function implies independent Laplace priors for each of the slopes, \[ \pi( \boldsymbol{\beta} ) = \exp (- R_{L_{1}} ( \boldsymbol{\beta} )) = \prod_{m = 1}^{M} \exp (- \lambda_{m} \left| \beta_{m} \right| ). \] With this prior the mode of the resulting posterior distribution will coincide with the penalized maximum likelihood estimator; unfortunately the mode is not a well-posed inference drawn from the posterior distribution. Proper inferences correspond instead of posterior expectation values that are informed by the entire posterior distribution. The affect of the Laplace prior on the full posterior distribution is not nearly as useful as its affect on the mode.
Because the maximum likelihood estimator considers only a single point in parameter space at a time, it is influenced by the either the regularizing behavior of the penalty below each \(\lambda_{m}\) or the laissez faire behavior above each \(\lambda_{m}\), but not both. The expanse of the posterior distribution, however, is influenced by both of these behaviors at the same time. While the shape of the Laplace prior below \(\lambda_{m}\) does induce some concentration of the posterior towards smaller values of \(\beta_{m}\), the heavy tail also drags significant posterior probabilty far above \(\lambda_{m}\).
These opposing behaviors induce regrettable features in the posterior for both the irrelevant slopes, which leak significant probability mass towards undesired large values, and relevant slopes, which are overshrunk towards smaller values. The former prevents us from identifying the irrelevant parameters while the latter biases our inferences. Moreover, in both cases the marginal posteriors become much more diffuse than warranted which inflates our inferential uncertainties.
If we want to induce sparsity in Bayesian decisions then we need a prior distribution with a more holistic approach to regularization. Luckily enough, we have the horseshoe.
To implement Bayesian sparsity we need a prior distribution that allows the data to collapse the entire marginal posterior for each slope towards relevance or irrelevance, but not both. In doing so we need to carefully define exactly what values of the slopes are consequential to the final inference and what values are not.
In order to provide the desired flexibility in the posterior for each slope we need a prior distribution that enforces a global scale while also giving each of slopes the flexibilty to transcend that scale as needed. Because we don’t know which slopes will need that flexibilty the desired prior will have to be exchangeable with respect to the slopes and hence manifest a hierachical structure.
The horseshoe prior (Carvalho, Polson, and Scott 2009) accomplishes this flexibility by setting the scale for each component to the product of a global scale, \(\tau\) and a local scale,\(\lambda_{m}\), each of which are themselves unknown parameters, \[ \begin{align*} \beta_{m} &\sim \mathcal{N} (0, \tau \cdot \lambda_{m}) \\ \lambda_{m} &\sim \text{Half-}\mathcal{C} (0, 1) \\ \tau &\sim \text{Half-}\mathcal{C} (0, \tau_{0}). \end{align*} \] The heavy-tailed Cauchy prior distribution for the local scales allows the data to push each to large values as needed, which then push the corresponding slopes above the global scale, \(\tau_{0}\). By making the global scale itself a parameter we also allow the data to refine the scale beyond our prior judgement of \(\tau_{0}\).
An immediate problem with the horseshoe prior is that the slopes that transcend the global scale are otherwise unregularized, leaving their posteriors to diffuse to extremely large values. This behavior allows nonidentification or weak identification of the likelihood to propagate to the posterior and compromise the validity of the resulting inferences. The Finnish horseshoe (Piironen and Vehtari 2017) remedies this vulnerability by introducing another level to the prior hierarchy, \[ \begin{align*} \beta_{m} &\sim \mathcal{N} (0, \tau \cdot \tilde{\lambda}_{m}) \\ \tilde{\lambda}_{m} &= \frac{c \lambda_{m}} {\sqrt{ c^{2} + \tau^{2} \lambda_{m}^{2}}} \\ \lambda_{m} &\sim \text{Half-}\mathcal{C} (0, 1) \\ c^{2} &\sim \text{Inv-}\mathcal{G} \, (\frac{\nu}{2}, \frac{\nu}{2} s^{2}) \\ \tau &\sim \text{Half-}\mathcal{C} (0, \tau_{0}). \end{align*} \] Integrating the new scale, \(c\), out of the distribution implies a marginal \(\text{Student-}t \, (\nu, 0, s)\) prior for each of the slopes, at least sufficiently far above the glboal treshold \(\tau_{0}\). Setting \(\nu\) and \(s\) appropriately then ensures that we can contain with the posterior within a few multiples of \(s\) around zero.
With the shape of a sparsity-inducing prior established we are left only with determining the prior hyperparameter \(\tau_{0}\) which effetively determines the scale below which slopes are irrelevant to the modeling of the output variate. The subtlety with specifying \(\tau_{0}\) is irrelevance is determined not by the prior distribution itself but rather our measurement process – the contribution of a slope is negligible only when it is indistinguishable from the inherent variabiltiy of our observations. As always the consequences of the prior depend on the context of the likelihood (Gelman, Simpson, and Betancourt 2017).
Let’s presume that we will model our measurement process as a linear regression, \[ y_{n} \sim \mathcal{N} (\mathbf{x}_{n} \cdot \boldsymbol{\beta} + \alpha, \sigma). \] The measurement variability, \(\sigma\) provides a natural scale for irrelevance in this model.
If we assume that the covariates are independent and distributed around zero with unit variation then a slope on the order of \(\sigma\) will yield an individual response, \(x_{nm} \beta_{m} \approx \sigma\), that would be hard to discriminate from the expected variation of the measurement itself a priori. Consequently \(\tau_{0} = \sigma\) seems a reasonable assumption.
When we consider the posterior behavior, however, we have to recognize that the data will typically inform our inferences beyond the scale of the measurement variability – with \(N\) independent observations we will be sensitive to effects as small as \(\sigma / \sqrt{N}\). If we ignore the number of observations then the consequnces of the resulting prior will change with the size of the data! This suggests instead that we take \(\tau_{0} = \sigma / \sqrt{N}\).
Our logic so far has relied on the expected contribution from each slope individually. That tails of the horseshoe priors, however, are heavy enough that when we consider the contribution from many slopes at once we will see that at least a few will exceed the irrelevance threshold and significantly effect the variate-covariate relationship. In order to avoid this ensemble behavior we have to reduce our prior scale even further to \[ \tau_{0} = \frac{m_{0}}{M - m_{0}} \frac{\sigma}{\sqrt{N}}, \] where \(m_{0}\) is the expected number of relevant slopes.
If we move away from pure linear regression then this argument has to be modified to account for nonlinearities in the measurement process. Piironen and Vehtari (2017) derives approximate scales appropriate to general linear models. In order to faciliate the optimal performance of the horseshoe family of prior distributions we must take care to ensure that the prior scale is compatible with the measurement process.
To demonstrate the subtlties in implementing sparsity in a Bayesian analysis let’s consider data simulated from a linear regression model,
writeLines(readLines("generate_data.stan"))transformed data {
int<lower=0> M = 200;
int<lower=0> N = 100;
real alpha = 3;
real sigma = 1;
real<lower=0, upper=1> sig_prob = 0.05;
}
generated quantities {
matrix[M, N] X;
real y[N];
vector[M] beta;
for (m in 1:M) {
if (bernoulli_rng(sig_prob))
if (bernoulli_rng(0.5))
beta[m] = normal_rng(10, 1);
else
beta[m] = normal_rng(-10, 1);
else
beta[m] = normal_rng(0, 0.25);
}
for (n in 1:N) {
for (m in 1:M)
X[m, n] = normal_rng(0, 1);
y[n] = normal_rng(X[1:M,n]' * beta + alpha, sigma);
}
}
The covariates are all independently distributed around zero with unit variance, and there are a population of large slopes and small, irrelevant slopes. Moreover, the data are collinear with more covariates than observations.
We begin by simulating a measurement,
library(rstan)Loading required package: ggplot2
Loading required package: StanHeaders
rstan (Version 2.17.2, GitRev: 2e1f913d3ca3)
For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)
Attaching package: 'rstan'
The following objects are masked _by_ '.GlobalEnv':
check_energy, check_treedepth
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
util <- new.env()
source('stan_utility.R', local=util)
source('plot_utility.R', local=util)
fit <- stan(file='generate_data.stan', iter=1,
chains=1, seed=194838, algorithm="Fixed_param")
SAMPLING FOR MODEL 'generate_data' NOW (CHAIN 1).
Iteration: 1 / 1 [100%] (Sampling)
Elapsed Time: 0 seconds (Warm-up)
0.002368 seconds (Sampling)
0.002368 seconds (Total)
X <- extract(fit)$X[1,,]
y <- extract(fit)$y[1,]
N <- dim(X)[2]
M <- dim(X)[1]
beta_true <- extract(fit)$beta[1,]
stan_rdump(c("N", "M", "X", "y", "beta_true"), file="linear_regression.data.R")Only 7 of the simulated slopes are large enough to significantly effect the regression, with the other slopes left largely irrelevant.
input_data <- read_rdump("linear_regression.data.R")
c_dark <- c("#8F2727")
c_dark_highlight <- c("#7C0000")
par(mar = c(4, 4, 0.5, 0.5))
hist(input_data$beta_true, main="", col=c_dark, border=c_dark_highlight,
xlab="True Slopes", yaxt='n', ylim=c(0, 40), ylab="",
breaks=11*(-100:100)/100)Our ability to recover these large slopes will depend intimately on the prior we incorporate into our Bayesian model.
Let’s begin with a uniform prior over the slopes,
writeLines(readLines("linear_regression_unif.stan"))data {
int<lower=1> N; // Number of data
int<lower=1> M; // Number of covariates
matrix[M, N] X;
real y[N];
}
parameters {
vector[M] beta;
real alpha;
real<lower=0> sigma;
}
model {
// No priors on the slopes
alpha ~ normal(0, 2);
sigma ~ normal(0, 2);
y ~ normal(X' * beta + alpha, sigma);
}
fit <- stan(file='linear_regression_unif.stan', data=input_data, seed=4938483)Warning: There were 4000 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
Warning: There were 3 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems
Because of the collinear data the linear regression likelihood is non-identified, and the uniform prior allows that non-identifiabilty to propagate to the posterior. The fit of the resulting posterior unsurpsingly fails in spectacular fashion, with vanishing effective sample sizes, large \(\hat{R}\), and failing HMC diagnostics.
capture.output(util$check_n_eff(fit))[1:5][1] "[1] \"n_eff / iter for parameter beta[1] is 0.00083754388792516!\""
[2] "[1] \"n_eff / iter for parameter beta[2] is 0.000508390188710226!\""
[3] "[1] \"n_eff / iter for parameter beta[3] is 0.000507794720866011!\""
[4] "[1] \"n_eff / iter for parameter beta[4] is 0.000530267296523219!\""
[5] "[1] \"n_eff / iter for parameter beta[5] is 0.000531620324942505!\""
capture.output(util$check_rhat(fit))[1:5][1] "[1] \"Rhat for parameter beta[1] is 3.52753512963361!\""
[2] "[1] \"Rhat for parameter beta[2] is 10.4072515792808!\""
[3] "[1] \"Rhat for parameter beta[3] is 13.2554369466892!\""
[4] "[1] \"Rhat for parameter beta[4] is 6.84599937631635!\""
[5] "[1] \"Rhat for parameter beta[5] is 7.13733173946251!\""
util$check_div(fit)[1] "0 of 4000 iterations ended with a divergence (0%)"
util$check_treedepth(fit)[1] "4000 of 4000 iterations saturated the maximum tree depth of 10 (100%)"
[1] " Run again with max_depth set to a larger value to avoid saturation"
util$check_energy(fit)[1] "Chain 1: E-BFMI = 0.141974019845751"
[1] "Chain 3: E-BFMI = 0.161669948033842"
[1] " E-BFMI below 0.2 indicates you may need to reparameterize your model"
The fit does, however, exhibit some of the massive uncertainty inherent to the non-identified posterior,
util$plot_post_quantiles(fit, input_data, "Uniform Prior")util$plot_residual_quantiles(fit, input_data, "Uniform Prior")util$plot_aux_posteriors(fit, "Uniform Prior")We definitley need a prior to compensate for the non-identified likelihood, but just how much prior informationt do we need? Let’s try a weakly-informative prior for all of the slopes that strongly concentrates below the scale of the measurement variability.
writeLines(readLines("linear_regression_narrow.stan"))data {
int<lower=1> N; // Number of data
int<lower=1> M; // Number of covariates
matrix[M, N] X;
real y[N];
}
parameters {
vector[M] beta;
real alpha;
real<lower=0> sigma;
}
model {
// Strongly regularizing priors on the slopes
beta ~ normal(0, 1);
alpha ~ normal(0, 2);
sigma ~ normal(0, 2);
y ~ normal(X' * beta + alpha, sigma);
}
fit <- stan(file='linear_regression_narrow.stan', data=input_data, seed=4938483)The fit is much better behaved,
util$check_all_diagnostics(fit)[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat looks reasonable for all parameters"
[1] "0 of 4000 iterations ended with a divergence (0%)"
[1] "0 of 4000 iterations saturated the maximum tree depth of 10 (0%)"
[1] "E-BFMI indicated no pathological behavior"
But the extreme regularization biases the marginal posteriors for the larger slopes far below their true values,
util$plot_post_quantiles(fit, input_data, "Narrow Prior")util$plot_residual_quantiles(fit, input_data, "Narrow Prior")util$plot_aux_posteriors(fit, "Narrow Prior")Would a weakly informative prior work if we expanded the scale so to encompass the breadth of the true slopes?
writeLines(readLines("linear_regression_wide.stan"))data {
int<lower=1> N; // Number of data
int<lower=1> M; // Number of covariates
matrix[M, N] X;
real y[N];
}
parameters {
vector[M] beta;
real alpha;
real<lower=0> sigma;
}
model {
// Weakly regularizing priors on the slopes
beta ~ normal(0, 10);
alpha ~ normal(0, 2);
sigma ~ normal(0, 2);
y ~ normal(X' * beta + alpha, sigma);
}
fit <- stan(file='linear_regression_wide.stan', data=input_data, seed=4938483)Warning: There were 4 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: There were 3703 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
Warning: There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems
The fit diagnostics hint at trouble, which is not unexpected given that the wider prior allows more of the likelihood non-identifiabilty to spread to the posterior.
util$check_all_diagnostics(fit)[1] "n_eff / iter looks reasonable for all parameters"
[1] "Rhat for parameter sigma is 1.32764377423625!"
[1] "Rhat for parameter lp__ is 1.4036971622549!"
[1] " Rhat above 1.1 indicates that the chains very likely have not mixed"
[1] "4 of 4000 iterations ended with a divergence (0.1%)"
[1] " Try running with larger adapt_delta to remove the divergences"
[1] "3703 of 4000 iterations saturated the maximum tree depth of 10 (92.575%)"
[1] " Run again with max_depth set to a larger value to avoid saturation"
[1] "Chain 1: E-BFMI = 0.0948802299555897"
[1] "Chain 2: E-BFMI = 0.05202893542316"
[1] "Chain 3: E-BFMI = 0.0495409309783256"
[1] "Chain 4: E-BFMI = 0.0730791546596158"
[1] " E-BFMI below 0.2 indicates you may need to reparameterize your model"
More importantly, the wide prior offers little regularization to the many slopes that are expected to be negligible. The resulting marginal posteriors exhibit little contraction away from the prior.
util$plot_post_quantiles(fit, input_data, "Wide Prior")util$plot_residual_quantiles(fit, input_data, "Wide Prior")util$plot_aux_posteriors(fit, "Wide Prior")If want to isolate the significant slopes then we have to encode the assumption of sparsity into our prior distribution. Before considering the family of horseshoe priors, however, let’s take a look at the Laplace prior motivated by a naive translation of the frequentist LASSO.
writeLines(readLines("linear_regression_laplace.stan"))data {
int<lower=1> N; // Number of data
int<lower=1> M; // Number of covariates
matrix[M, N] X;
real y[N];
}
parameters {
vector[M] beta;
real alpha;
real<lower=0> sigma;
}
model {
// Strongly regularizing priors
beta ~ double_exponential(0, 1);
alpha ~ normal(0, 2);
sigma ~ normal(0, 2);
y ~ normal(X' * beta + alpha, sigma);
}
fit <- stan(file='linear_regression_laplace.stan', data=input_data, seed=4938483,
control=list(adapt_delta=0.99, max_treedepth=12))Warning: There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems
Even with the large treedepths and small adaptation target, adapt_delta, the fit with the Laplace prior exhibits slow mixing.
util$check_n_eff(fit)[1] "n_eff / iter looks reasonable for all parameters"
util$check_rhat(fit)[1] "Rhat for parameter sigma is 1.15226728483963!"
[1] "Rhat for parameter lp__ is 1.21462867500161!"
[1] " Rhat above 1.1 indicates that the chains very likely have not mixed"
util$check_div(fit)[1] "0 of 4000 iterations ended with a divergence (0%)"
util$check_treedepth(fit, 12)[1] "0 of 4000 iterations saturated the maximum tree depth of 12 (0%)"
util$check_energy(fit)[1] "Chain 1: E-BFMI = 0.0510589671576194"
[1] "Chain 2: E-BFMI = 0.0576039171062086"
[1] "Chain 3: E-BFMI = 0.0283443796848285"
[1] "Chain 4: E-BFMI = 0.0309561682185903"
[1] " E-BFMI below 0.2 indicates you may need to reparameterize your model"
The posterior seems much better behaved compared to our failures up to this point, but we do see somewhat large uncertainties and excessive regularization of the larger slopes as expected from our previous discussion.
util$plot_post_quantiles(fit, input_data, "Laplace Prior")util$plot_residual_quantiles(fit, input_data, "Laplace Prior")util$plot_aux_posteriors(fit, "Laplace Prior")In order to encode proper inferential sparsity into our Bayesian analysis we need to consider the horseshoe or variants thereof.
writeLines(readLines("linear_regression_horseshoe.stan"))data {
int<lower=1> N; // Number of data
int<lower=1> M; // Number of covariates
matrix[M, N] X;
real y[N];
}
parameters {
vector[M] beta_tilde;
vector<lower=0>[M] lambda;
real<lower=0> tau_tilde;
real alpha;
real<lower=0> sigma;
}
model {
// tau ~ cauchy(0, sigma)
// beta ~ normal(0, tau * lambda)
vector[M] beta = beta_tilde .* lambda * sigma * tau_tilde;
beta_tilde ~ normal(0, 1);
lambda ~ cauchy(0, 1);
tau_tilde ~ cauchy(0, 1);
alpha ~ normal(0, 2);
sigma ~ normal(0, 2);
y ~ normal(X' * beta + alpha, sigma);
}
fit <- stan(file='linear_regression_horseshoe.stan', data=input_data, seed=4938483,
control=list(adapt_delta=0.99, max_treedepth=15))Warning: There were 138 divergent transitions after warmup. Increasing adapt_delta above 0.99 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
Warning: There were 998 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 15. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
Warning: There were 4 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems
Unfortunately, the fit isn’t great. While the horseshoe does have a reputation for being challenging to fit, its lack of regularization of the large slopes is particularly troublesome in the context of the non-identified likelihood.
util$check_n_eff(fit)[1] "n_eff / iter looks reasonable for all parameters"
util$check_rhat(fit)[1] "Rhat for parameter tau_tilde is 1.1297586390972!"
[1] "Rhat for parameter lp__ is 1.115677574797!"
[1] " Rhat above 1.1 indicates that the chains very likely have not mixed"
util$check_div(fit)[1] "138 of 4000 iterations ended with a divergence (3.45%)"
[1] " Try running with larger adapt_delta to remove the divergences"
util$check_treedepth(fit, 15)[1] "998 of 4000 iterations saturated the maximum tree depth of 15 (24.95%)"
[1] " Run again with max_depth set to a larger value to avoid saturation"
util$check_energy(fit)[1] "Chain 1: E-BFMI = 0.0713483330532017"
[1] "Chain 2: E-BFMI = 0.0446199194337883"
[1] "Chain 3: E-BFMI = 0.117955902178254"
[1] "Chain 4: E-BFMI = 0.0673480425208603"
[1] " E-BFMI below 0.2 indicates you may need to reparameterize your model"
That said, the sparsity induced by the horseshoe prior does allow the posterior to become much more narrow if not quite as accurate as we’d like.
util$plot_post_quantiles(fit, input_data, "Horseshoe Prior")util$plot_residual_quantiles(fit, input_data, "Horseshoe Prior")util$plot_aux_posteriors(fit, "Horeshoe Prior")Finally we can consider incorporating a Finnish horseshoe into our model. Here we assume 10 large slopes and tune the extra level of regularization to place most of the prior mass below a magnitude of 10.
writeLines(readLines("linear_regression_finnish_horseshoe.stan"))data {
int<lower=1> N; // Number of data
int<lower=1> M; // Number of covariates
matrix[M, N] X;
real y[N];
}
// slab_scale = 5, slab_df = 25 -> 8 divergences
transformed data {
real m0 = 10; // Expected number of large slopes
real slab_scale = 3; // Scale for large slopes
real slab_scale2 = square(slab_scale);
real slab_df = 25; // Effective degrees of freedom for large slopes
real half_slab_df = 0.5 * slab_df;
}
parameters {
vector[M] beta_tilde;
vector<lower=0>[M] lambda;
real<lower=0> c2_tilde;
real<lower=0> tau_tilde;
real alpha;
real<lower=0> sigma;
}
transformed parameters {
vector[M] beta;
{
real tau0 = (m0 / (M - m0)) * (sigma / sqrt(1.0 * N));
real tau = tau0 * tau_tilde; // tau ~ cauchy(0, tau0)
// c2 ~ inv_gamma(half_slab_df, half_slab_df * slab_scale2)
// Implies that marginally beta ~ student_t(slab_df, 0, slab_scale)
real c2 = slab_scale2 * c2_tilde;
vector[M] lambda_tilde =
sqrt( c2 * square(lambda) ./ (c2 + square(tau) * square(lambda)) );
// beta ~ normal(0, tau * lambda_tilde)
beta = tau * lambda_tilde .* beta_tilde;
}
}
model {
beta_tilde ~ normal(0, 1);
lambda ~ cauchy(0, 1);
tau_tilde ~ cauchy(0, 1);
c2_tilde ~ inv_gamma(half_slab_df, half_slab_df);
alpha ~ normal(0, 2);
sigma ~ normal(0, 2);
y ~ normal(X' * beta + alpha, sigma);
}
fit <- stan(file='linear_regression_finnish_horseshoe.stan', data=input_data,
seed=4938483,
control=list(adapt_delta=0.99, max_treedepth=15))Warning: There were 2 chains where the estimated Bayesian Fraction of Missing Information was low. See
http://mc-stan.org/misc/warnings.html#bfmi-low
Warning: Examine the pairs() plot to diagnose sampling problems
As with any Finn, the Finnish horseshoe can be a bit stubborn so we have to push the limits of Hamiltonian Monte Carlo a bit to resolve all of the structure it induces in the posterior distribution. We some tuning, however, we recover a good fit.
util$check_n_eff(fit)[1] "n_eff / iter looks reasonable for all parameters"
util$check_rhat(fit)[1] "Rhat looks reasonable for all parameters"
util$check_div(fit)[1] "0 of 4000 iterations ended with a divergence (0%)"
util$check_treedepth(fit, 15)[1] "0 of 4000 iterations saturated the maximum tree depth of 15 (0%)"
util$check_energy(fit)[1] "E-BFMI indicated no pathological behavior"
Excitingly, the Finnish horseshoe admits narrow posteriors around both the large and small slopes without compromising accuracy.
util$plot_post_quantiles(fit, input_data, "Finnish Horseshoe Prior")util$plot_residual_quantiles(fit, input_data, "Finnish Horseshoe Prior")util$plot_aux_posteriors(fit, "Finnish Horeshoe Prior")Sparsity is a potentially powerful but ultimately very subtle assumption. Only by utilizing prior distributions that induce sparsity in a principled manner can we truely exploit the assumption in those circumstances where it is appropriate. In particular, any proper application of sparsity requires a detailed understanding of how the sparse parameters effect the measurement process itself.
I thank Dan Simpson, Aki Vehtari, and Juho Piironen for many helpful conversations about sparsity.
writeLines(readLines(file.path(Sys.getenv("HOME"), ".R/Makevars")))CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function -Wno-macro-redefined
CC=clang
CXX=clang++ -arch x86_64 -ftemplate-depth-256
devtools::session_info("rstan")Session info -------------------------------------------------------------
setting value
version R version 3.4.2 (2017-09-28)
system x86_64, darwin15.6.0
ui X11
language (EN)
collate en_US.UTF-8
tz America/New_York
date 2018-03-04
Packages -----------------------------------------------------------------
package * version date source
BH 1.65.0-1 2017-08-24 CRAN (R 3.4.1)
colorspace 1.3-2 2016-12-14 CRAN (R 3.4.0)
dichromat 2.0-0 2013-01-24 CRAN (R 3.4.0)
digest 0.6.12 2017-01-27 CRAN (R 3.4.0)
ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.4.0)
graphics * 3.4.2 2017-10-04 local
grDevices * 3.4.2 2017-10-04 local
grid 3.4.2 2017-10-04 local
gridExtra 2.3 2017-09-09 CRAN (R 3.4.1)
gtable 0.2.0 2016-02-26 CRAN (R 3.4.0)
inline 0.3.14 2015-04-13 CRAN (R 3.4.0)
labeling 0.3 2014-08-23 CRAN (R 3.4.0)
lattice 0.20-35 2017-03-25 CRAN (R 3.4.2)
lazyeval 0.2.1 2017-10-29 CRAN (R 3.4.2)
magrittr 1.5 2014-11-22 CRAN (R 3.4.0)
MASS 7.3-47 2017-02-26 CRAN (R 3.4.2)
Matrix 1.2-11 2017-08-21 CRAN (R 3.4.2)
methods * 3.4.2 2017-10-04 local
munsell 0.4.3 2016-02-13 CRAN (R 3.4.0)
plyr 1.8.4 2016-06-08 CRAN (R 3.4.0)
R6 2.2.2 2017-06-17 CRAN (R 3.4.0)
RColorBrewer 1.1-2 2014-12-07 CRAN (R 3.4.0)
Rcpp 0.12.13 2017-09-28 CRAN (R 3.4.2)
RcppEigen 0.3.3.3.0 2017-05-01 CRAN (R 3.4.0)
reshape2 1.4.2 2016-10-22 CRAN (R 3.4.0)
rlang 0.1.4 2017-11-05 CRAN (R 3.4.2)
rstan * 2.17.2 2017-12-21 CRAN (R 3.4.2)
scales 0.5.0 2017-08-24 CRAN (R 3.4.1)
StanHeaders * 2.17.1 2017-12-20 CRAN (R 3.4.2)
stats * 3.4.2 2017-10-04 local
stats4 3.4.2 2017-10-04 local
stringi 1.1.5 2017-04-07 CRAN (R 3.4.0)
stringr 1.2.0 2017-02-18 CRAN (R 3.4.0)
tibble 1.3.4 2017-08-22 CRAN (R 3.4.1)
tools 3.4.2 2017-10-04 local
utils * 3.4.2 2017-10-04 local
viridisLite 0.2.0 2017-03-24 CRAN (R 3.4.0)
Carvalho, Carlos M., Nicholas G. Polson, and James G. Scott. 2009. “Handling Sparsity via the Horseshoe.” In Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, edited by David van Dyk and Max Welling, 5:73–80. Proceedings of Machine Learning Research. Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA: PMLR. http://proceedings.mlr.press/v5/carvalho09a.html.
Gelman, Andrew, Daniel Simpson, and Michael Betancourt. 2017. “The Prior Can Often Only Be Understood in the Context of the Likelihood.” Entropy 19 (10). doi:10.3390/e19100555.
Piironen, Juho, and Aki Vehtari. 2017. “Sparsity Information and Regularization in the Horseshoe and Other Shrinkage Priors.” Electron. J. Statist. 11 (2). The Institute of Mathematical Statistics; the Bernoulli Society: 5018–51. doi:10.1214/17-EJS1337SI.
Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society. Series B (Methodological) 58 (1). [Royal Statistical Society, Wiley]: 267–88. http://www.jstor.org/stable/2346178.